lubridate for working with dates in R and ggplot2library(tidyverse)
library(ggthemes)
library(knitr)
library(broom)
library(stringr)
library(modelr)
options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())For more details on the
lubridatepackage, check out R for Data Science.
When importing data, date variables can be somewhat tricky to correctly store and utilize. In a spreadsheet, tabular format, dates by default will appear either as numeric (20174018) or string (2016-04-18, April 18th, 2017, etc.) columns. If you want to perform tasks such as: extracting and summarizing over individual components (year, month, day, etc.), we need to represent dates in a different, yet standardized, format.
lubridate is a tidyverse package that facilitates working with dates (and date-times) in R.
library(lubridate)##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
When using readr to import data files, R will use parse_date() or parse_datetime() to try and format any columns it thinks contain dates or date-times. To manually format dates from strings, use the appropriate function combining y, m, and d in the proper order depending on the original format of the date:
ymd("2017-01-31")## [1] "2017-01-31"
mdy("January 31st, 2017")## [1] "2017-01-31"
dmy("31-Jan-2017")## [1] "2017-01-31"
Let’s practice extracting components of dates using an example dataset. flights-departed.csv is a time series data file containing the daily number of departing commercial flights in the United States from 1988-2008.
(flights <- read_csv("data/flights-departed.csv"))## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## value = col_integer()
## )
## # A tibble: 7,671 × 2
## date value
## <date> <int>
## 1 1988-01-01 12681
## 2 1988-01-02 13264
## 3 1988-01-03 13953
## 4 1988-01-04 13921
## 5 1988-01-05 13932
## 6 1988-01-06 13157
## 7 1988-01-07 11159
## 8 1988-01-08 11631
## 9 1988-01-09 12045
## 10 1988-01-10 13160
## # ... with 7,661 more rows
We will use ggplot2 to generate several graphs based on the data. The first is a simple line plot over time of the daily commercial flights. To build this, we don’t need to modify flights:
ggplot(flights, aes(date, value)) +
geom_line() +
labs(x = NULL,
y = "Number of departing commercial flights")But this is quite noisy. Instead, let’s draw a line plot depicting commercial flights over a one-year period, with separate lines for each year in the data (1988, 1989, 1990, etc.). To do that, we need to create a new variable year which will serve as our grouping variable in ggplot():
(flights <- flights %>%
mutate(year = year(date),
yday = yday(date),
# hack to label the x-axis with months
days = dmy(format(date,"%d-%m-2016"))))## # A tibble: 7,671 × 5
## date value year yday days
## <date> <int> <dbl> <dbl> <date>
## 1 1988-01-01 12681 1988 1 2016-01-01
## 2 1988-01-02 13264 1988 2 2016-01-02
## 3 1988-01-03 13953 1988 3 2016-01-03
## 4 1988-01-04 13921 1988 4 2016-01-04
## 5 1988-01-05 13932 1988 5 2016-01-05
## 6 1988-01-06 13157 1988 6 2016-01-06
## 7 1988-01-07 11159 1988 7 2016-01-07
## 8 1988-01-08 11631 1988 8 2016-01-08
## 9 1988-01-09 12045 1988 9 2016-01-09
## 10 1988-01-10 13160 1988 10 2016-01-10
## # ... with 7,661 more rows
ggplot(flights, aes(days, value)) +
geom_line(aes(group = year), alpha = .2) +
geom_smooth(se = FALSE) +
scale_x_date(labels = scales::date_format("%b")) +
labs(x = NULL,
y = "Number of departing commercial flights")## `geom_smooth()` using method = 'gam'
Or we could summarize the distribution of departing commercial flights by days in each month over the 20 year time period:
(flights <- flights %>%
mutate(month = month(date, label = TRUE)))## # A tibble: 7,671 × 6
## date value year yday days month
## <date> <int> <dbl> <dbl> <date> <ord>
## 1 1988-01-01 12681 1988 1 2016-01-01 Jan
## 2 1988-01-02 13264 1988 2 2016-01-02 Jan
## 3 1988-01-03 13953 1988 3 2016-01-03 Jan
## 4 1988-01-04 13921 1988 4 2016-01-04 Jan
## 5 1988-01-05 13932 1988 5 2016-01-05 Jan
## 6 1988-01-06 13157 1988 6 2016-01-06 Jan
## 7 1988-01-07 11159 1988 7 2016-01-07 Jan
## 8 1988-01-08 11631 1988 8 2016-01-08 Jan
## 9 1988-01-09 12045 1988 9 2016-01-09 Jan
## 10 1988-01-10 13160 1988 10 2016-01-10 Jan
## # ... with 7,661 more rows
ggplot(flights, aes(month, value)) +
geom_violin() +
geom_boxplot(width = .1, outlier.shape = NA) +
labs(x = NULL,
y = "Number of departing commercial flights")Hmmm, there seems to be an outlier in September. What’s up with that?
Finally, we can generate a heatmap depicting the change over time of this data by creating a calendar-like visualization.1 In order do this, we need the following grammar for the graph:
value (number of departing flights)identitygeom_tile()facet_grid() (year X month)In order to generate this graph then, we need to create several new variables for flights:
We can use lubridate to directly generate three of those variables (we’ve already generated year and month):
(flights <- flights %>%
mutate(weekday = wday(date, label = TRUE)))## # A tibble: 7,671 × 7
## date value year yday days month weekday
## <date> <int> <dbl> <dbl> <date> <ord> <ord>
## 1 1988-01-01 12681 1988 1 2016-01-01 Jan Fri
## 2 1988-01-02 13264 1988 2 2016-01-02 Jan Sat
## 3 1988-01-03 13953 1988 3 2016-01-03 Jan Sun
## 4 1988-01-04 13921 1988 4 2016-01-04 Jan Mon
## 5 1988-01-05 13932 1988 5 2016-01-05 Jan Tues
## 6 1988-01-06 13157 1988 6 2016-01-06 Jan Wed
## 7 1988-01-07 11159 1988 7 2016-01-07 Jan Thurs
## 8 1988-01-08 11631 1988 8 2016-01-08 Jan Fri
## 9 1988-01-09 12045 1988 9 2016-01-09 Jan Sat
## 10 1988-01-10 13160 1988 10 2016-01-10 Jan Sun
## # ... with 7,661 more rows
We use label = TRUE to generate factor labels for these values (January, February, March) instead of the numeric equivalent (1, 2, 3).
To generate the final week-in-month variable, we need to combine a few lubridate functions to get exactly what we want:
(flights <- flights %>%
# generate variables for week in the year (1-54) and the day in the year (1-366)
mutate(week = week(date),
yday = yday(date)) %>%
# normalize to draw calendar correctly - wday should represent the number of days from the Sunday of the week containing January 1st, then adjust based on that
group_by(year) %>%
mutate(yday = yday + wday(date)[1] - 2,
week = floor(yday / 7)) %>%
group_by(year, month) %>%
mutate(week_month = week - min(week) + 1))## Source: local data frame [7,671 x 9]
## Groups: year, month [252]
##
## date value year yday days month weekday week week_month
## <date> <int> <dbl> <dbl> <date> <ord> <ord> <dbl> <dbl>
## 1 1988-01-01 12681 1988 5 2016-01-01 Jan Fri 0 1
## 2 1988-01-02 13264 1988 6 2016-01-02 Jan Sat 0 1
## 3 1988-01-03 13953 1988 7 2016-01-03 Jan Sun 1 2
## 4 1988-01-04 13921 1988 8 2016-01-04 Jan Mon 1 2
## 5 1988-01-05 13932 1988 9 2016-01-05 Jan Tues 1 2
## 6 1988-01-06 13157 1988 10 2016-01-06 Jan Wed 1 2
## 7 1988-01-07 11159 1988 11 2016-01-07 Jan Thurs 1 2
## 8 1988-01-08 11631 1988 12 2016-01-08 Jan Fri 1 2
## 9 1988-01-09 12045 1988 13 2016-01-09 Jan Sat 1 2
## 10 1988-01-10 13160 1988 14 2016-01-10 Jan Sun 2 3
## # ... with 7,661 more rows
Now that we have the data correctly formatted and all the components are extracted, we can draw the graph:
ggplot(flights, aes(weekday, week_month, fill = value)) +
facet_grid(year ~ month) +
geom_tile(color = "black") +
scale_fill_continuous(low = "green", high = "red") +
scale_x_discrete(labels = NULL) +
scale_y_reverse(labels = NULL) +
labs(title = "Domestic commercial flight activity",
x = NULL,
y = NULL,
fill = "Number of departing flights") +
theme_void() +
theme(legend.position = "bottom")Aha, now the outlier makes sense. In the days following the September 11th attacks, the United States grounded virtually all commercial air traffic.
When examining multivariate continuous data, scatterplots are a quick and easy visualization to assess relationships. However if the data points become too densely clustered, interpreting the graph becomes difficult. Consider the diamonds dataset:
p <- ggplot(diamonds, aes(carat, price)) +
geom_point() +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Carat size",
y = "Price")
pWhat is the relationship between carat size and price? It appears positive, but there are also a lot of densely packed data points in the middle of the graph. Smoothing lines are a method for summarizing the relationship between variables to capture important patterns by approximating the functional form of the relationship. The functional form can take on many shapes. For instance, a very common functional form is a best-fit line, also known as ordinary least squares (OLS) or simple linear regression. We can estimate the model directly using lm(), or we can directly plot the line by using geom_smooth(method = "lm"):
p +
geom_smooth(method = "lm", se = FALSE)The downside to a linear best-fit line is that it assumes the relationship between the variables is additive and monotonic. Therefore the summarized relationship between carat size and price seems wildly incorrect for diamonds with a carat size larger than 3. Instead we could use a generalized additive model which allow for flexible, non-linear relationships between the variables while still implementing a basic regression approach:2
p +
geom_smooth(se = FALSE)## `geom_smooth()` using method = 'gam'
Locally weighted scatterplot smoothing (local regression, LOWESS, or LOESS) fits a separate non-linear function at each target point \(x_0\) using only the nearby training observations. This method estimates a regression line based on localized subsets of the data, building up the global function \(f\) point-by-point.
Here is an example of a local linear regression on the ethanol dataset in the lattice package:
The LOESS is built up point-by-point:
One important argument you can control with LOESS is the span, or how smooth the LOESS function will become. A larger span will result in a smoother curve, but may not be as accurate. A smaller span will be more local and wiggly, but improve our fit to the training data.
LOESS lines are best used for datasets with fewer than 1000 observations, otherwise the time and memory usage required to compute the line increases exponentially.
r_plot <- function(r, n = 100){
xy <- ecodist::corgen(len = n, r = r) %>%
bind_cols
ggplot(xy, aes(x, y)) +
geom_point() +
ggtitle(str_c("Pearson's r = ", r))
}
r <- c(.8, 0, -.8)
for(r in r){
print(r_plot(r))
}stat_smooth() methodsdevtools::session_info()## Session info --------------------------------------------------------------
## setting value
## version R version 3.3.3 (2017-03-06)
## system x86_64, darwin13.4.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2017-04-18
## Packages ------------------------------------------------------------------
## package * version date source
## assertthat 0.1 2013-12-06 CRAN (R 3.3.0)
## backports 1.0.5 2017-01-18 CRAN (R 3.3.2)
## broom * 0.4.2 2017-02-13 CRAN (R 3.3.2)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.3.2)
## DBI 0.6 2017-03-09 CRAN (R 3.3.3)
## devtools 1.12.0 2016-06-24 CRAN (R 3.3.0)
## digest 0.6.12 2017-01-27 CRAN (R 3.3.2)
## dplyr * 0.5.0 2016-06-24 CRAN (R 3.3.0)
## evaluate 0.10 2016-10-11 CRAN (R 3.3.0)
## forcats 0.2.0 2017-01-23 CRAN (R 3.3.2)
## foreign 0.8-67 2016-09-13 CRAN (R 3.3.3)
## ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2)
## ggthemes * 3.4.0 2017-02-19 CRAN (R 3.3.2)
## gtable 0.2.0 2016-02-26 CRAN (R 3.3.0)
## haven 1.0.0 2016-09-23 cran (@1.0.0)
## hms 0.3 2016-11-22 CRAN (R 3.3.2)
## htmltools 0.3.5 2016-03-21 CRAN (R 3.3.0)
## httr 1.2.1 2016-07-03 CRAN (R 3.3.0)
## jsonlite 1.3 2017-02-28 CRAN (R 3.3.2)
## knitr * 1.15.1 2016-11-22 cran (@1.15.1)
## lattice 0.20-34 2016-09-06 CRAN (R 3.3.3)
## lazyeval 0.2.0 2016-06-12 CRAN (R 3.3.0)
## lubridate 1.6.0 2016-09-13 CRAN (R 3.3.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.3.0)
## memoise 1.0.0 2016-01-29 CRAN (R 3.3.0)
## mnormt 1.5-5 2016-10-15 CRAN (R 3.3.0)
## modelr * 0.1.0 2016-08-31 CRAN (R 3.3.0)
## munsell 0.4.3 2016-02-13 CRAN (R 3.3.0)
## nlme 3.1-131 2017-02-06 CRAN (R 3.3.3)
## plyr 1.8.4 2016-06-08 CRAN (R 3.3.0)
## psych 1.7.3.21 2017-03-22 CRAN (R 3.3.2)
## purrr * 0.2.2 2016-06-18 CRAN (R 3.3.0)
## R6 2.2.0 2016-10-05 CRAN (R 3.3.0)
## Rcpp 0.12.10 2017-03-19 cran (@0.12.10)
## readr * 1.1.0 2017-03-22 cran (@1.1.0)
## readxl 0.1.1 2016-03-28 CRAN (R 3.3.0)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.3.0)
## rmarkdown 1.3 2016-12-21 CRAN (R 3.3.2)
## rprojroot 1.2 2017-01-16 CRAN (R 3.3.2)
## rvest 0.3.2 2016-06-17 CRAN (R 3.3.0)
## scales 0.4.1 2016-11-09 CRAN (R 3.3.1)
## stringi 1.1.2 2016-10-01 CRAN (R 3.3.0)
## stringr * 1.2.0 2017-02-18 CRAN (R 3.3.2)
## tibble * 1.2 2016-08-26 cran (@1.2)
## tidyr * 0.6.1 2017-01-10 CRAN (R 3.3.2)
## tidyverse * 1.1.1 2017-01-27 CRAN (R 3.3.2)
## withr 1.0.2 2016-06-20 CRAN (R 3.3.0)
## xml2 1.1.1 2017-01-24 CRAN (R 3.3.2)
## yaml 2.1.14 2016-11-12 cran (@2.1.14)
geom_smooth() automatically implements the gam method for datasets with greater than 1000 observations.↩